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PERSON TAGGING IN AN IMAGE PROCESSING SYSTEM 
UTILIZING A STATISTICAL MODEL BASED ON BOTH APPEARANCE AND GEOMETRIC 

FEATURES 

5 Field of the Invention 

The present invention relates generally to the field of image 
processing, and more particularly to techniques for implementing a 
person tagging feature in an image processing system. 

10 Background of the Invention 

Person tagging in image processing systems refers generally to 
the process of characterizing a person observed in an image or 
sequence of images of a video signal, and using the characterization 
to determine if the same person is present in one or more subsequent 
J15 images. A detected person is "tagged" by association with the 
\j characterization, and can thereby be identified as the tagged person 
in subsequent images. The process of person tagging is thus distinct 
|; from a person recognition process in that it does not necessarily 

involve definitive identification of a given person as being a 
5 20 particular known individual. Instead, it simply generates an 
Z. indication that a person in a current image is believed to match a 
O person detected in a previous image. The person tagging process is 
i2 also referred to as person matching. 

C Conventional person tagging generally involves the use of either 

25 appearance-based or geometry-based detection algorithms. The 
appearance-based algorithms include techniques such as template 
matching and color histograms. Examples of features used in geometry- 
based algorithms include size, shape, etc. The conventional 
techniques, however, have been unable to combine appearance and 
30 geometric features in a manner which provides more efficient and 
effective person tagging for an image processing system. 

Summary of the Invention 

The present invention solves the above-noted problem of 



US000273 



1 



conventional person tagging techniques by providing a method and 
apparatus in which appearance features and geometric features are both 
incorporated into a statistical model of a particular tagged person. 
The statistical models generated for a given set of persons present in 
5 images of a given video segment or other image sequence may be used 
for detection, location and tracking of the persons in subsequently- 
processed images. 

In accordance with one aspect of the invention, an image 
processing system processes a sequence of images to generate a 

10 statistical model for each of a number of different persons to be 
tagged so as to be identifiable in subsequent images. The statistical 
model for a given tagged person incorporates at least one appearance 
feature, such as color, texture, etc., and at least one geometric 
feature, such as shape or position of a designated region of similar 

15 appearance within one or more images. The models are applied to 
subsequent images in order to perform a person detection, person 
location and/or person tracking operation. An action of the image 
processing system is controlled based on a result of the operation. 

In accordance with another aspect of the invention, the 

20 statistical model for a given tagged person may be generated by 
separating one or more images into a number N of different regions of 
similar appearance . 

In accordance with a further aspect of the invention, the 
statistical model generated for a given person may be in the form of 

25 a likelihood probability function which indicates the likelihood that 
the person is present in a given image or set of images. 

As noted previously, a significant advantage of the present 
invention is that it utilizes statistical models which incorporate 
both appearance and geometric features. The use of models which 

30 combine these different types of features significantly improves the 
performance of the person tagging process. For example, such an 
approach ensures that the system will be less likely to confuse 
persons crossing one another or persons partially occluded by other 
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objects in given image sequence* 

The present invention can be used in a wide variety of image 
processing applications, such as video conferencing systems, video 
surveillance and monitoring systems, and human-machine interfaces. 

Brief Description of the Drawings 

FIG, 1 is a block diagram of an image processing system in which 
the present invention may be implemented. 

FIG. 2 illustrates an example person tagging process in 
accordance with the present invention. 

FIG. 3 illustrates a translation operation that may be utilized 
in a person tagging process in accordance with the present invention. 

FIG. 4 is a flow diagram of an example person tagging process in 
accordance with the present invention. 



Detailed Description of the Invention 

FIG. 1 shows an image processing system 10 in which person 
tagging techniques in accordance with the invention may be 
implemented. The system 10 includes a processor 12, a memory 14, an 
input/output (I/O) device 15 and a controller 16, all of which are 
connected to communicate over a set 17 of one or more system buses or 
other type of interconnections. The system 10 further includes a 
camera 18 that is coupled to the controller 16 as shown. The camera 18 
may be, e.g., a mechanical pan-tilt-zoom (PTZ) camera, a wide-angle 
25 electronic zoom camera, or any other suitable type of image capture 
device. It should therefore be understood that the term "camera" as 
used herein is intended to include any type of image capture device as 
well as any configuration of multiple such devices. 

The system 10 may be adapted for use in any of a number of 
different image processing applications, including, e.g., video 
conferencing, video surveillance, human-machine interfaces, etc. More 
generally, the system 10 can be used in any application that can 
benefit from the improved person tagging capabilities provided by the 
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present invention. 

In operation, the image processing system 10 generates a video 
signal or other type of sequence of images of a person 20. The camera 
18 may be adjusted such that the person 20 comes within a field of 
view 22 of the camera 18. A video signal corresponding to a sequence 
of images generated by the camera 18 is then processed in system 10 
using the person tagging techniques of the invention, as will be 
described in greater detail below. An output of the system may then be 
adjusted based on the detection of a particular tagged person in a 
given sequence of images. For example, a video conferencing system, 
human-machine interface or other type of system application may 
generate a query or other output or take another type of action based 
on the detection of a tagged person. Any other type of control of an 
3 action of the system may be based at least in part on the detection of 
115 a tagged person. 

^ Elements or groups of elements of the system 10 may represent 

I corresponding elements of an otherwise conventional desktop or 

lj portable computer, as well as portions or combinations of these and 
other processing devices. Moreover, in other embodiments of the 
invention, some or all of the functions of the processor 12, memory 
14, controller 16 and/or other elements of the system 10 may be 
combined into a single device. For example, one or more of the 
elements of system 10 may be implemented as an application specific 
integrated circuit (ASIC) or circuit card to be incorporated into a 
25 computer, television, set-top box or other processing device. 

The term "processor" as used herein is intended to include a 
microprocessor, central processing unit (CPU), microcontroller, 
digital signal processor (DSP) or any other data processing element 
that may be utilized in a given image processing system. In addition, 
it should be noted that the memory 14 may represent an electronic 
memory, an optical or magnetic disk-based memory, a tape-based memory, 
as well as combinations or portions of these and other types of 
storage devices. 
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The present invention provides improvements over conventional 
person tagging techniques through the use of statistical models based 
on both appearance features and geometric features. The term "tagging" 
as used herein refers generally to the generation of a statistical 
5 model characterizing a particular person in one or more images of a 
given image sequence. A person that has been "tagged" in this manner 
can then be detected, located and/or tracked in one or more subsequent 
images of the same sequence or of another sequence. 

FIG. 2 illustrates an example of a person tagging process in 

10 accordance with the present invention. An image 25 which includes 
person 20 is generated and processed in system 10 such that the image 
is segmented into a number N of different regions of similar 
appearance. The index r is used to identify a particular one of the 
regions. In this example, the image 25 is segmented into a total of N 

15 =3 different regions corresponding to portions 2 6-1, 26-2 and 26-3 of 
the original image 25. P (I | Q denotes the likelihood probability 
function of a statistical model generated for a given person Q, and 
indicates the likelihood that the person Q is present in a given image 
I. The likelihood probability function P (I | Q of the statistical 

20 model for person Q may be computed as 



30 



P(l|Q) - £ P(i^ r |Q)P(r|Q f 



r = 1,2,... N 



where R r is a function of at least one appearance feature and at least 
25 one geometric feature. The appearance features may include color, 
texture, etc., and the geometric features may include region shape as 
well as relative region position within the image. 

The general person tagging process illustrated in FIG. 2 involves 
building statistical models of persons from one or more images and 
using those models for detection and location of the tagged persons in 
subsequent images. 
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The process can also be configured to provide tracking of a 
tagged person, as will now be described in detail in conjunction with 
FIG. 3. Let P (I | T, Q be the likelihood probability function of the 
statistical model of the person Q. T is a linear transformation used 
to capture global motion of the person in the image space, and £ is 
a discrete variable introduced to capture the state of the local 
motion of the person at a given point in time, where the term "local 
motion" is intended to include articulated motion, i.e., the relative 
motion of different parts of a whole. For example, the position of a 
person in a room can be obtained from the linear transformation T, 
while the pose of the person (standing, sitting, etc.) can be 
determined from the discrete variable £ . 

FIG. 3 illustrates the operation of the linear transformation T. 
As shown in the figure, the linear transformation T is used to obtain 
a sub-window 30 of the image I that is invariant to rotation and 
scale. It may be implemented, for example, using a bilinear 
interpolation technique with a reference point x c in the input image 
I, a rotation angle 9, and a scaling factor s. 

The above-noted local motion is modeled using a discrete set of 
states ... 4 of the variable £ to capture M different poses of the 

person jQ. 

The detection and location of the person Q in the image I in the 
person tagging process of the invention may be implemented using the 
following maximum likelihood search: 

T 1 = arg max ^P(I | T, Q)P(£ | O . 

Tracking a tagged person, in contrast to detection and location, 
takes advantage of the history of the known positions and poses of the 
person from previous images, e.g., previous frames of a given video 
segment. For a video segment V t = {l 0 , i lf ... , i t , the likelihood 
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probability P (V t | 1^, £ t _ lf ... f T 0 , £ 0 , Q is maximized to obtain the 

optimum trajectory of the person T 0 *, £ 0 , £} . This maximum 

likelihood search provides tracking of a tagged person, and can be 
efficiently implemented using well-known conventional techniques such 
as the Viterbi algorithm or a forward-backward algorithm. 

The likelihood probability of a video sequence can be written in 
terms of the likelihood probability of individual frames as follows: 



P (V t | T tr % tf % t _ ir ... , T Qf <% Q , Q) - 
P(I t I ^, <f t , C»)P(T; I X^-r^^m I 

where P (T t | T^,..., T 0 characterizes global motion model and could be 
implemented using, e.g., a Kalman filter, and P (£ \ 2; t _ lf Q, 
characterizes local motion, and could be implemented as a first order 
Markov model using a transition matrix. 

In accordance with the invention, different statistical models of 
the type described above are generated, one for each person present in 
a given video segment or other type of image sequence. The person 
tagging process can then provide detection, location and tracking by 
associating the trajectory of each tagged person with an identifier of 
the best matching model. 

As noted previously, a significant advantage of the present 
invention is that it utilizes statistical models which incorporate 
both appearance and geometric features. The use of models which 
combine these different types of features significantly improves the 
performance of the person tagging process. For example, it ensures 
that the system will be less likely to confuse persons crossing one 
another or persons partially occluded by other objects in the sequence 
of video frames. 

The generation of the statistical models based on both appearance 
and geometric features will now be described in greater detail. For 
simplicity and clarity of illustration, the pixels in an image I of a 
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person Q may be considered independent from one another* In other 
words, 

P (I | T, Q) = J] P (pix | T, n . 

5 

As previously noted in conjunction with FIG. 2, r is an index to 
regions of similar appearance and N is the total number of such 
regions, r = 1, 2, . . . N, so that: 

10 P (pix I T, £, Q) = r _raaxjp (pix | r, T, £ Q) P(r | «)], 

where P (pix \ r, T, Q is the probability of observing the pixel pix 
assuming that it belongs to the r-th region of the person's model on 
that pose, and P(r I £, Cl is the prior probability of the region at that 
15 pose. In order to handle occlusions and new exposures, a dummy region 
may be added with a constant probability as follows: 

P (pix | r occlU31or , T, £ Q) P (r occluslon I |, O) = P occlU31on . 

20 Every pixel in the image may be characterized by its position x 

(a two-dimensional vector), and by its appearance features f (color, 
texture, etc.), so that: 

P (pix | r, T, Q) = P (x | r, T, £, Q) P (f | r, 

25 

where P (x | r, T, #, Q and P (f I r, T, £, Q may both be approximated as 
Gaussian distributions over their corresponding feature spaces. The 
above-noted appearance features vector f can be obtained for a given 
pixel from the pixel itself or from a designated "neighborhood" of 
30 pixels around the given pixel. As previously noted, examples of such 
appearance features include color and texture. Color features may be 
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determined in accordance with parameters of well-known color spaces 
such as RGB, HIS, CIE, etc. The texture features may be obtained using 
well-known conventional techniques such as edge detection, texture 
gradients, Gabor filters, Tamura feature generation, etc. 

FIG. 4 is a flow diagram summarizing the above-described person 
tagging process of the present invention. In step 40, a video segment 
or other type of image sequence is processed to generate an appearance 
and geometry based statistical model P (I | T, Q for each person Q to 
be tagged. In step 42, the resulting model or set of models is stored 
in a memory of the image processing system, e.g., in memory 14 of 
system 10. Finally, in step 44, one or more subsequent images are 
processed using the stored models to perform at least one of a person 
detection, person location and person tracking operation. The one or 
more subsequent images may be subsequent images from the same video 
s jl5 segment or other image sequence, or from a different image sequence, 
-j The processing operations of steps 40, 42 and 44 may be carried out 

using software executed by processor 12 of system 10. 
1 Tne above-described embodiments of the invention are intended to 

be illustrative only. For example, the techniques of the invention 
~2 0 can be implemented using a variety of different person tagging 
processes, including processes involving any one or more of person 
detection, person location and person tracking. In addition, the 
% invention can be used to provide person tagging capability in a wide 
variety of applications, including video conferencing systems, video 
25 surveillance systems, and other camera-based systems. Furthermore, the 
invention can be implemented at least in part in the form of one or 
more software programs which are stored on an electronic, magnetic or 
optical storage medium and executed by a processing device, e.g., by 
the processor 12 of system 10. These and numerous other embodiments 
within the scope of the following claims will be apparent to those 
skilled in the art. 
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Claims 

What is claimed is: 

1. A method of person tagging in an image processing system, the 
method comprising the steps of: 

processing a sequence of images to generate a statistical 
model for each person to be tagged, the statistical model 
incorporating at least one appearance feature and at least one 
geometric feature of the tagged person; 

applying the model to at least one subsequent image in order 
to perform at least one of a detection operation, a location operation 
and a tracking operation for the tagged person; and 

controlling an action of the image processing system based 
on a result of the at least one operation. 

; 15 2 - The method of claim 1 wherein the sequence of images 

\ comprises a video segment. 

| 3. The method of claim 1 wherein the processing step further 

includes processing the sequence of images to generate a plurality of 
statistical models, each of the models corresponding to a particular 
tagged person. 

4. The method of claim 1 wherein the appearance feature 
comprises at least one of a color feature and a texture feature. 

5. The method of claim 1 wherein the geometric feature comprises 
at least one of a region shape and a region position for a given one 
of a plurality of regions associated with the statistical model. 

6. The method of claim 1 wherein the statistical model is 
generated at least in part by segmenting a given image into a number 
N of different regions of similar appearance. 
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7. The method of claim 1 wherein the statistical model generated 
for a given person CI comprises a likelihood probability function 
P(I I Q which indicates the likelihood that the person Q is present 
in a given image I. 

8. The method of claim 7 wherein the likelihood probability 
function P (I | Q for person Q is computed as 

P(l|Q) - V(R r p)P(r\Q f 

r = l,2,...N 

where R r is a function of the at least one appearance feature and the 
at least one geometric feature, and r is an index identifying one of 
N regions of similar appearance within the image I. 

9. The method of claim 1 wherein the statistical model generated 
for a given person Q comprises a likelihood probability function 
P(I I T, |, Q , where T is a linear transformation used to capture global 
motion of the person in an image I, and { is a discrete variable used 
to capture local motion of the person at a given point in time. 

10. The method of claim 9 wherein a location of the person is 
determined using the linear transformation T. 

11. The method of claim 9 wherein a pose of the person is 
determined using the discrete variable £ . 

12. The method of claim 9 wherein the linear transformation T 
is used to obtain a sub-window of the image I that is invariant to 
rotation and scale. 

13. The method of claim 9 wherein the linear transformation T is 
US000273 11 



implemented using a bilinear interpolation technique with a reference 
point x c in the image I, a rotation angle 9, and a scaling factor s. 



14. The method of claim 9 wherein the local motion is modeled 
using a discrete set of states of the variable £ to capture 
M different poses of the person Q. 

15. The method of claim 1 wherein the statistical model 
generated for a given person Q and image I comprises a likelihood 
probability function 

P(I I T,£fi) = £P(pix I , 

pdx e I 

where r is an index to regions of similar appearance and Wis a total 
number of such regions, r = 1, 2, . . . N, and 

P (pix | T, Q.) = max [p (pix | r, T, Q) P(r \ t, Q)] r 

where P (pix \ r, T, Q is the probability of observing pixel pix 
assuming that it belongs to an r-th region of the model on a pose 
and P(r \ SI is the prior probability of the region at that pose. 

16. The method of claim 15 wherein the regions of similar 
appearance include a dummy region having a constant probability as 
follows : 

p (P^ I ^occiu 31 on, T, £ fi) P (r occlusion | £ H) = P occlusion . 



17. The method of claim 15 wherein each of at least a subset of 
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the pixels of the image I is characterized by a two-dimensional 
position vector x and by an appearance feature vector f such that: 



P (pix | r, T, fi) = P (x | r, T, £, Q) P (f | r, T, £ Q , 

where P(x|r,I,^,Q and P(f | r, T, Q are approximated as Gaussian 
distributions over corresponding feature spaces. 

18. The method of claim 1 wherein the controlling step comprises 
generating an output of the image processing system based on the 
result of the at least one operation. 

19. The method of claim 1 wherein the controlling step comprises 
altering an operating parameter of the image processing system based 
on the result of the at least one operation. 

20. An apparatus for use in providing person tagging in an image 
processing system, the apparatus comprising: 

a processor operative to process a sequence of images to 
generate a statistical model for each person to be tagged, the 
statistical model incorporating at least one appearance feature and at 
least one geometric feature of the tagged person, the processor being 
further operative to apply the model to at least one subsequent image 
in order to perform at least one of a detection operation, a location 
operation and a tracking operation for the tagged person, and further 
wherein an action of the image processing system is controlled based 
on a result of the at least one operation. 

21. An article of manufacture comprising a storage medium for 
storing one or more programs for use in providing person tagging in an 
image processing system, wherein the one or more programs when 
executed by a processor implement the steps of: 

processing a sequence of images to generate a statistical 
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model for each person to be tagged, the statistical model 
incorporating at least one appearance feature and at least one 
geometric feature of the tagged person; and 

applying the model to at least one subsequent image in order 
to perform at least one of a detection operation, a location operation 
and a tracking operation for the tagged person; 

wherein an action of the image processing system is 
controlled based on a result of the at least one operation. 
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Abstract 

An image processing system processes a sequence of images to 
generate a statistical model for each of a number of different persons 
to be tagged so as to be identifiable in subsequent images. The 
statistical model for a given tagged person incorporates at least one 
appearance feature, such as color, texture, etc., and at least one 
geometric feature, such as shape or position of a designated region of 
similar appearance within one or more images. The models are applied 
to subsequent images in order to perform a person detection, person 
location and/or person tracking operation. An action of the image 
processing system is controlled based on a result of the operation. 
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c/o U.S. PHILIPS CORPORATION, Intellectual Property Department, 580 
White Plains Road, Tarrytown, New York 10591, his Associate 
Attorney (s) /Agent (s) with all the usual powers to prosecute the 
above- identified application and any division or continuation 
thereof, to make alterations and amendments therein, and to 
transact all business in the Patent and Trademark Office connected 
therewith. 

ALL CORRESPONDENCE CONCERNING THIS APPLICATION AND THE 
LETTERS PATENT WHEN GRANTED SHOULD BE ADDRESSED TO THE UNDERSIGNED 
ATTORNEY OF RECORD. 

Respectfully, 





E. Haken, Reg. 26,902 
orney of Record 



Dated at Tarrytown, New York 
this October 26, 2000 
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